library(plyr)
-------------------------------------------------------------------------------------------------------------------------
You have loaded plyr after dplyr - this is likely to cause problems.
If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
library(plyr); library(dplyr)
-------------------------------------------------------------------------------------------------------------------------
Attaching package: ‘plyr’
The following objects are masked from ‘package:plotly’:
arrange, mutate, rename, summarise
The following objects are masked from ‘package:dplyr’:
arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
The following object is masked from ‘package:purrr’:
compact
Read in data from files
Prepering/transforming the data into a usable form for analysis, visualization, etc… #### Merging Dataframes Merge beer data with the breweries data. Print the first 6 observations and the last six observations to check the merged file.
attach(beer)
The following objects are masked from beer (pos = 3):
ABV, Beer_ID, Brewery_id, IBU, Name, Ounces, Style
The following objects are masked from beer (pos = 5):
ABV, Beer_ID, Brewery_id, IBU, Name, Ounces, Style
beer[order(Brewery_id),] # sort the data to determine column for merge
# merge on Brewery ID
breweries_named <- rename(breweries, c("Brew_ID"="Brewery_id"))
brewing_beer <- merge(breweries_named,beer,by="Brewery_id", all=TRUE) # outter join
brewed_beer <- rename(brewing_beer, c("Name.x"="Brewery", "Name.y"="Beer")) # rename breweries and beer
head(brewed_beer,6) # show the first 6 rows of data
Missing data are in columns ABV (62) and IBU (1005) only. Cleaning data in multiple options: 1. complete records only 2. replacing NA with the averages of the remainder of the column
colSums(is.na(averaged_beer))
Brewery_id Brewery City State Beer Beer_ID ABV IBU Style Ounces
0 0 0 0 0 0 0 0 0 0
Median of Alcohol by Volume and Bitterness by State
ABV_bar <-ggplot(data=median_df, aes(x = State, y = ABV, fill = State)) +
geom_bar(stat="identity", width = 0.75) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Median ABV by State")
ggplotly(ABV_bar)
IBU_bar <-ggplot(data=median_df, aes(x = State, y = IBU, fill = State)) +
geom_bar(stat="identity", width = 0.75) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Median IBU by State")
ggplotly(IBU_bar)
Which state has the maximum alcoholic (ABV) beer? Which state has the most bitter (IBU) beer?
The summary statistics and distribution of the ABV variable.
Is there an apparent relationship between the bitterness of the beer and its alcoholic content? Draw a scatter plot. Make your best judgment of a relationship and EXPLAIN your answer.
Budweiser would also like to investigate the difference with respect to IBU and ABV between IPAs (India Pale Ales) and other types of Ale (any beer with “Ale” in its name other than IPA). You decide to use KNN classification to investigate this relationship. Provide statistical evidence one way or the other. You can of course assume your audience is comfortable with percentages … KNN is very easy to understand conceptually.
In addition, while you have decided to use KNN to investigate this relationship (KNN is required) you may also feel free to supplement your response to this question with any other methods or techniques you have learned. Creativity and alternative solutions are always encouraged.
Knock their socks off! Find one other useful inference from the data that you feel Budweiser may be able to find value in. You must convince them why it is important and back up your conviction with appropriate statistical evidence.